{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Examples of Text and Tasks\n",
    "\n",
    "For each type of task just introduced, we provide a\n",
    "motivating example. These examples are based on real tasks that we have carried\n",
    "out, but to focus on the concept, we've reduced the data to snippets.\n",
    "\n",
    "## Convert Text into a Standard Format  \n",
    "\n",
    "Let's say we want to study connections\n",
    "between population demographics and election results.\n",
    "To do this, we've taken election data from Wikipedia and population data from the US Census Bureau.\n",
    "The granularity of the data is the county level, and we need to use the county names to join the tables.\n",
    "Unfortunately, the county names in these two tables don't always match:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "election = pd.DataFrame({\n",
    "    'County': ['De Witt County', 'Lac qui Parle County', 'Lewis and Clark County',\n",
    "        'St John the Baptist Parish'],\n",
    "    'State': ['IL', 'MN', 'MT', 'LA'],\n",
    "    'Voted': ['97.8', '98.8', '95.2', '52.6']\n",
    "    \n",
    "})\n",
    "census = pd.DataFrame({\n",
    "    'County': ['DeWitt  ', 'Lac Qui Parle', 'Lewis & Clark', 'St. John the Baptist'],\n",
    "        'State': ['IL', 'MN', 'MT', 'LA'],\n",
    "    'Population': ['16,798', '8,067', '55,716','43,044']\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div style=\"display: flex; gap: 1rem;\">\n",
       "        <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>County</th>\n",
       "      <th>State</th>\n",
       "      <th>Voted</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>De Witt County</td>\n",
       "      <td>IL</td>\n",
       "      <td>97.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Lac qui Parle County</td>\n",
       "      <td>MN</td>\n",
       "      <td>98.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lewis and Clark County</td>\n",
       "      <td>MT</td>\n",
       "      <td>95.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>St John the Baptist Parish</td>\n",
       "      <td>LA</td>\n",
       "      <td>52.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>County</th>\n",
       "      <th>State</th>\n",
       "      <th>Population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DeWitt</td>\n",
       "      <td>IL</td>\n",
       "      <td>16,798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Lac Qui Parle</td>\n",
       "      <td>MN</td>\n",
       "      <td>8,067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Lewis &amp; Clark</td>\n",
       "      <td>MT</td>\n",
       "      <td>55,716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>St. John the Baptist</td>\n",
       "      <td>LA</td>\n",
       "      <td>43,044</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "        </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "dfs_side_by_side(election, census)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can't join the tables until we clean the strings to have a common format for county names. We need to change the case of characters, use\n",
    "common spellings and abbreviations, and address punctuation."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extract a Piece of Text to Create a Feature\n",
    "\n",
    "Text data sometimes has a lot of structure, especially when it was generated\n",
    "by a computer.\n",
    "As an example, the following is a web server's log entry.\n",
    "Notice how the entry has multiple pieces of data, but the pieces don't have \n",
    "a consistent delimiter---for instance, the date appears in square brackets,\n",
    "but other parts of the data appear in quotes and parentheses:\n",
    "\n",
    "```\n",
    "169.237.46.168 - -\n",
    "[26/Jan/2004:10:47:58 -0800]\"GET /stat141/Winter04 HTTP/1.1\" 301 328\n",
    "\"http://anson.ucdavis.edu/courses\"\n",
    "\"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)\"\n",
    "```\n",
    "\n",
    "Even though the file format doesn’t align with one of the simple formats we saw\n",
    "in {numref}`Chapter %s <ch:files>`, we can use text processing techniques to\n",
    "extract pieces of text to create features."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform Text into Features\n",
    "\n",
    "In {numref}`Chapter %s <ch:wrangling>`, we\n",
    "created a categorical feature based on the content of the strings. There, we examined the\n",
    "descriptions of restaurant violations and we created nominal variables for the\n",
    "presence of particular words.\n",
    "We've displayed a few example violations here:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "unclean or degraded floors walls or ceilings\n",
    "inadequate and inaccessible handwashing facilities\n",
    "inadequately cleaned or sanitized food contact surfaces\n",
    "wiping cloths not clean or properly stored or inadequate sanitizer\n",
    "foods not protected from contamination\n",
    "unclean nonfood contact surfaces\n",
    "unclean or unsanitary food contact surfaces\n",
    "unclean hands or improper use of gloves\n",
    "inadequate washing facilities or equipment\n",
    "These new features can be used in an analysis of food safety scores.\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Previously, we made simple features that marked whether a description contained\n",
    "a word like _glove_ or _hair_. In this chapter, we more formally introduce the regular expression tools that we used to create these features."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text Analysis\n",
    "\n",
    "Sometimes we want to compare entire documents.\n",
    "For example, the US president gives a State of the Union speech every year. Here are the first few lines of the very first speech:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "*** \n",
    "\n",
    "State of the Union Address\n",
    "George Washington\n",
    "January 8, 1790\n",
    "\n",
    "Fellow-Citizens of the Senate and House of Representatives:\n",
    "I embrace with great satisfaction the opportunity which now presents itself\n",
    "of congratulating you on the present favorable prospects of our public …\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We might wonder: How have the State of the Union speeches changed over time? Do different political parties focus on different topics or use different language in their speeches?\n",
    "To answer these questions, we can transform the speeches into a numeric form\n",
    "that lets us use statistics to compare them."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These examples serve to illustrate the ideas of string manipulation, regular\n",
    "expressions, and text analysis. We start with describing simple string manipulation."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
